In this mini-project, we shall use binary classification to classify an image into cancerous (class label $1$) or benign (class label $0$), i.e., to identify metastatic cancer in small image patches taken from larger digital pathology scans (this problem is taken from a past Kaggle completition).
PCam packs the clinically-relevant task of metastasis detection into a straight-forward binary image classification task, which the CNN model performs. The CNN Model is trained on Colab GPU on a given set of images and then it predicts / detects tumors in a set of unseen test images.
The dataset provides a large number of small pathology images to classify. Files are named with an image id. The train_labels.csv file provides the ground truth for the images in the train folder. The goal is to predict the labels for the images in the test folder. A positive label indicates that the center $32\times 32$px region of a patch contains at least one pixel of tumor tissue. Tumor tissue in the outer region of the patch does not influence the label.
First we need to import all python packages / functions that are required for building the CNN model. We shall use tensorflow / keras to train the deep learning model.
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
import seaborn as sns
import os
import tensorflow as tf
import tensorflow.keras as keras
from keras.layers import Conv2D, MaxPool2D, BatchNormalization, Flatten, Dropout, Dense, Activation
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential
from tensorflow.keras.optimizers import Adam
import visualkeras
from tifffile import imread
As can be seen, the number of images in training folder (ones with ground-truth labels) and test folders (without ground-truth labels) are around 220k and 57k, respectively.
# count number of training and test images
base_dir = 'histopathologic-cancer-detection'
train_dir, test_dir = f'{base_dir}/train/', f'{base_dir}/test/'
ntrain, ntest = len(os.listdir(train_dir)), len(os.listdir(test_dir))
print(f'#training images = {ntrain}, #test inages={ntest}')
The train_labels.csv file is loaded as a pandas DataFrame (first few rows of the dataframe are shown below). It contains the id of the tif image files from the training folder, along with their ground-truth label. Let's have the file names in the id column instead, by concatenating the .tif extension to the id values. It will turn out to be useful later while reading the files from the folder automatically.
# read the training images and ground truth labels into a dataframe
train_df = pd.read_csv(f'{base_dir}/train_labels.csv')
train_df['label'] = train_df['label'].astype(str)
train_df['id'] = train_df['id'] + '.tif'
train_df.head()
The images are RGB color images of shape $96 \times 96$, as seen from the next code snippet, the images are loaded using tif imread() function.
imread(f'{train_dir}/{train_df.id[0]}').shape
Also, as can be seen from below plot, there are around 130k benign and 89k cancerous images in the trainign dataset, hence the dataset is not very imbalanced, also the data size is large. Hence we are not using augmentation like preprocessing steps.
# histogram of class labels
sns.displot(data=train_df, x='label', hue='label')
train_df['label'] = train_df['label'].astype(int)
train_df['label'].value_counts()
Next let's plot 100 randomly selected bening and cancerous images (with label 0 and 1, respectively) from training dataset to visuallhy inspect the difference, if any.
def plot_images(df, label, n=100):
df_sub = df.loc[df.label == label].sample(n)
imfiles = df_sub['id'].values
plt.figure(figsize=(15,15))
for i in range(n):
im = imread(f'{train_dir}/{imfiles[i]}')
plt.subplot(10,10,i+1)
plt.imshow(im)
plt.axis('off')
plt.suptitle(f'sample train images with label = {label}', size=20)
plt.tight_layout()
plt.show()
# sample and plot images from different classes
for label in train_df.label.unique():
plot_images(train_df, label)
Quite a few models were trained to classify the images, starting from simpler baseline models to more complex models.
We shall implement the model using keras model subclassing with configurable reusable blocks (so that we can reuse them and change the hyperparameters) and functional APIs.
The models will be traditional CNN models, which will comprise of a few Convolution/Pooling blocks (to obtain translation invariance, increase field of view, reduce number of parameters etc.) followed by flatten, Dense a binary classification layer.
There will be a Convolution Block implemented by the class ConvBlock which will contain 2 Conv2D layers of same size followed by a MaxPool2D layer. There can be an optional BatchNormalization layer (to reduce internal covariate shift between the batches) immediately following each fo the the Conv2D layers. The filter size, kernel size, activation, pooling size and whether batch norm will be present or not will be configurable (with default values) and can be specified during instantiation of the block.
The next reusable block will be the TopBlock that will contain a Flatten layer followed by two dense layers, the last of which will be classification layer, which will have a single output neuron. The size of the dense layer will be configurable.
By default ReLU non-linear activation will be used in the layers, except in the last classifier layer, which will use sigmoid activation (to keep the output in the range $[0-1]$, that can be interpreted as probability whether an image contains tumor cells or not).
The model is shown below:

The batch_size and im_size will also be hyperparameters that can be tuned / changed. We shall use batch size of 256 and all the images will be resized to $64 \times 64$, in order to save memory.
batch_size, im_size = 256, 64
class ConvBlock(tf.keras.layers.Layer):
'''
implements CovBlock as
Conv2D -> Conv2D -> MaxPool OR
Conv2D -> BatchNorm -> Conv2D -> BatchNorm -> MaxPool
as a reusable block in the CNN to be created
'''
def __init__(self, n_filter, kernel_sz=(3,3), activation='relu', pool_sz=(2,2), batch_norm=False):
# initialize with different hyperparamater values such as number of filters in Conv2D, kernel size, activation
# BatchNorm present or not
super(ConvBlock, self).__init__()
self.batch_norm = batch_norm
self.conv_1 = Conv2D(n_filter, kernel_sz, activation=activation)
self.bn_1 = BatchNormalization()
self.conv_2 = Conv2D(n_filter, kernel_sz, activation=activation)
self.bn_2 = BatchNormalization()
self.pool = MaxPool2D(pool_size=pool_sz)
def call(self, x):
# forward pass
x = self.conv_1(x)
if self.batch_norm:
x = self.bn_1(x)
x = tf.keras.layers.ReLU()(x)
x = self.conv_2(x)
if self.batch_norm:
x = self.bn_2(x)
x = tf.keras.layers.ReLU()(x)
return self.pool(x)
class TopBlock(tf.keras.layers.Layer):
'''
implements the top layer of the CNN
Flatten -> Dense -> Classification
as a reusable block
'''
def __init__(self, n_units=256, activation='relu', drop_out=False, drop_rate=0.5):
# initialize the block with configurable hyperparameter values, e.g., number of neurons in Dense block, activation
# Droput present or not
super(TopBlock, self).__init__()
self.drop_out = drop_out
self.flat = tf.keras.layers.Flatten()
self.dropout = Dropout(drop_rate)
self.dense = tf.keras.layers.Dense(n_units, activation=activation)
self.classifier = tf.keras.layers.Dense(1, activation='sigmoid')
def call(self, x, training=False):
# forward pass
x = self.flat(x)
#if training:
# x = self.dropout(x)
x = self.dense(x)
return self.classifier(x)
We shall create couple of models by tuning the hyperparameters Conv2D filtersize and Dense layersize, with / without normalization:
CNNModel1: the first one without BatchNormalization and ConvBlock sizes $16$, $32$, respectively and dense layer size $256$, as shown in the next code snippet.class CNNModel1(tf.keras.Model):
def __init__(self, input_shape=(im_size,im_size,3), n_class=1):
super(CNNModel1, self).__init__()
# the first conv module
self.conv_block_1 = ConvBlock(16)
# the second conv module
self.conv_block_2 = ConvBlock(32)
# model top
self.top_block = TopBlock(n_units=256)
def call(self, inputs, training=False, **kwargs):
# forward pass
x = self.conv_block_1(inputs)
x = self.conv_block_2(x)
return self.top_block(x)
# instantiate and build the model without BatchNorm
model1 = CNNModel1()
model1.build(input_shape=(batch_size,im_size[0],im_size[1],3))
model1.summary()
The model architecutre looks like the following (the model can be defined using keras Sequential too):

CNNModel2: the second one with BatchNormalization and ConvBlock sizes $32$, $32$, respectively and dense layer size $512$, defined / instantiated using the next code snippet. class CNNModel2(tf.keras.Model):
def __init__(self, input_shape=(im_size,im_size,3), n_class=1):
super(CNNModel2, self).__init__()
# the first conv module
self.conv_block_1 = ConvBlock(32, batch_norm=True)
# the second conv module
self.conv_block_2 = ConvBlock(32, batch_norm=True)
# model top
self.top_block = TopBlock(n_units=512)
def call(self, inputs, training=False, **kwargs):
# forward pass
x = self.conv_block_1(inputs)
x = self.conv_block_2(x)
return self.top_block(x)
# instantiate and build the model with BatchNorm
model2 = CNNModel2()
model2.build(input_shape=(batch_size,im_size,im_size,3))
model2.summary()
We shall also use couple of popular architectures, namely, VGG16, VGG19 and ResNet50 and add a couple of layers with the models by removing the top layer, for the classfication.
base_model = tf.keras.applications.VGG16(
input_shape=(im_size[0],im_size[1],3),
include_top=False,
weights='imagenet'
)
#color_map = get_color_map()
#visualkeras.layered_view(base_model, color_map=color_map, legend=True)
np.random.seed(1)
tf.random.set_seed(1)
model_vgg16 = Sequential([
base_model,
Flatten(),
BatchNormalization(),
Dense(16, activation='relu'),
Dropout(0.3),
Dense(8, activation='relu'),
Dropout(0.3),
BatchNormalization(),
Dense(1, activation='sigmoid')
], name='vgg16_backbone')
model_vgg16.summary()
base_model = tf.keras.applications.VGG19(
input_shape=(im_size[0],im_size[1],3),
include_top=False,
weights='imagenet'
)
#color_map = get_color_map()
#visualkeras.layered_view(base_model, color_map=color_map, legend=True)
np.random.seed(1)
tf.random.set_seed(1)
model_vgg19 = Sequential([
base_model,
Flatten(),
BatchNormalization(),
Dense(16, activation='relu'),
Dropout(0.3),
Dense(8, activation='relu'),
Dropout(0.3),
BatchNormalization(),
Dense(1, activation='sigmoid')
], name='vgg19_backbone')
model_vgg19.summary()
base_model = tf.keras.applications.ResNet50(
input_shape=(im_size[0],im_size[1],3),
include_top=False,
weights='imagenet'
)
#color_map = get_color_map()
#visualkeras.layered_view(base_model, color_map=color_map, legend=True)
np.random.seed(1)
tf.random.set_seed(1)
model_resnet50 = Sequential([
base_model,
Flatten(),
BatchNormalization(),
Dense(16, activation='relu'),
Dropout(0.5),
Dense(8, activation='relu'),
Dropout(0.5),
BatchNormalization(),
Dense(1, activation='sigmoid')
], 'resnet50_backbone')
model_resnet50.summary()
Now, we need to read the images, to do this automatically during the training phase, we shall use ImageDataGenerator with the flow_from_dataframe() method and hold-out 25% of the training images for validation performance evaluation.
# scale the images to have pixel values in between [0-1]
# create 75-25 train-validation split of the training dataset for model evaluation
generator = ImageDataGenerator(rescale=1./255, validation_split=0.25)
train_data = generator.flow_from_dataframe(
dataframe = train_df,
x_col='id', # filenames
y_col='label', # labels
directory=train_dir,
subset='training',
class_mode='binary',
batch_size=batch_size,
target_size=im_size)
val_data = generator.flow_from_dataframe(
dataframe=train_df,
x_col='id', # filenames
y_col='label', # labels
directory=train_dir,
subset="validation",
class_mode='binary',
batch_size=batch_size,
target_size=im_size)
The model without the BatchNormalization layers is first trained. Adam optimizer is used with learning_rate=0.001 (higher learning rates seems to diverge), with loss function as BCE (binary_crossentropy) and trained for 10 epochs. We shall use the BCE loss and accuracy metrics on the held-out validation dataset for model evaluation.
# compile and fit the first model with Adam optimizer, BCE loss and to be evaluated with accuracy metric
opt = Adam(learning_rate=0.0001)
model1.compile(optimizer=opt, loss='binary_crossentropy', metrics=['accuracy'])
hist = model.fit(train_data, validation_data=val_data, epochs=10)
As can be seen, the validation loss went as low as 0.3611 and validation accuracy went upto 84%. The next figure shows how both the training and validation loss decreased over epochs, whereas both the training and validation accuracy increased steadily over epochs during training, with the model.
def plot_hist(hist):
plt.figure(figsize=(10,5))
plt.subplot(121)
plt.plot(hist.history["accuracy"])
plt.plot(hist.history['val_accuracy'])
plt.legend(["Accuracy","Validation Accuracy"])
plt.ylabel("Accuracy")
plt.xlabel("Epoch")
plt.grid()
plt.subplot(122)
plt.plot(hist.history['loss'])
plt.plot(hist.history['val_loss'])
plt.title("Model Evaluation")
plt.ylabel("Loss")
plt.xlabel("Epoch")
plt.grid()
plt.legend(["Loss","Validation Loss"])
plt.tight_layout()
plt.show()
plot_hist(hist)

Next we tried the model with batch normalization enabled and the other hyperparameters tune, as described above. The optimizer and number of epochs used were same as above.
# compile and fit the second model with Adam optimizer, BCE loss and to be evaluated with accuracy metric
opt = Adam(learning_rate=0.0001)
model.compile(optimizer=opt, loss='binary_crossentropy', metrics=['accuracy'])
hist = model.fit(data_train, validation_data=data_validate, epochs=10)
As can be seen, the validation loss went to 0.4302 and validation accuracy went upto 85%. The next figure shows how both the training loss / accuracy steadily decreased / increased over epochs, resepectively, but the validation loss / accuracy became unstable.
plot_hist(hist)

We shall attempt transfer learning / fine tuning (by starting with pretrained weights on imagenet and keeping the VGG16 backbone weights frozen) and also train VGG16 from scratch and compare the results obtained in terms of accuracy and ROC AUC (area under the curve).
opt = tf.keras.optimizers.Adam(0.001)
base_model.trainable = False
model_vgg16.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy', tf.keras.metrics.AUC()])
hist = model_vgg16.fit(train_data, epochs = 20, validation_data = val_data, verbose=1)
plot_hist(hist)
base_model.trainable = True
model_vgg16.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy', tf.keras.metrics.AUC()])
%%time
hist = model_vgg16.fit(train_data, epochs = 20, validation_data = val_data, verbose=1)
plot_hist(hist)
This is the model that performed best and obtained ~88% public ROC score on the unseen test dataset.
opt = tf.keras.optimizers.Adam(0.001)
base_model.trainable = True
model_vgg19.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy', tf.keras.metrics.AUC()])
%%time
hist = model_vgg19.fit(train_data, epochs = 20, validation_data = val_data, verbose=1)
model_vgg19.save('model_vgg19_20.h5')
plot_hist(hist)
base_model.trainable = True
opt = tf.keras.optimizers.Adam(0.001)
model_resnet50.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy', tf.keras.metrics.AUC()])
%%time
hist = model_resnet50.fit(train_data, validation_data=val_data, epochs=20, verbose=1)
plot_hist(hist)
As seen from the above figure, the ResNet50 model's accuracy and loss both are quite fluctuating not stable and need more training.
Again a dataframe with test image ids were stored in a dataframe and the test images were automatically read during rediction phase with the function flow_from_dataframe(), as before. The predictions are submitted to kaggle to obtain the score (although leaderboard selection was disabled).
import os
images_test = pd.DataFrame({'id':os.listdir(test_dir)})
generator_test = ImageDataGenerator(rescale=1./255) # scale the test images to have pixel values in [0-1]
test_data = generator_test.flow_from_dataframe(
dataframe = images_test,
x_col='id', # filenames
directory=test_dir,
class_mode=None,
batch_size=1,
target_size=im_size,
shuffle=False)
# predict with the model
predictions = model1.predict(test_data, verbose=1)
predictions = predictions.squeeze()
predictions.shape
# create submission dataframe for kaggle submission
submission_df = pd.DataFrame()
submission_df['id'] = images_test['id'].apply(lambda x: x.split('.')[0])
submission_df['label'] = list(map(lambda x: 0 if x < 0.5 else 1, predictions))
submission_df['label'].value_counts()
submission_df.to_csv('submission2.csv', index=False)
print(submission_df.head())
With the first baseline model (CNNModel1) ~82.7% ROC score was obtained on the unseen test dataset in kaggle, as shown below, which was more than the one obtained with the second baseline model (CNNModel2).
The ROC scores obtained with the models with VGG16, VGG19 backbones were much better and the results obtained with them by training from scratch were better than those obtained with transfer learning / fine tuning (since we have a huge dataset) - the best score was obtained with the model with VGG19 backbone trained from sctrach.

As we could see, the CNN model without BatchNormalization (CNNModel1) outputperformed the other model (CNNModel2) with one, given a small number of epochs (namely 10) were used to train both the models. It's likely that the CNNModel2 could improve its generalizability on the unseen test images, if it were trained for more epochs (e.g. 30 epochs).
We have also trained popular CNN architectures such as VGG1/19 both using pre-trained weights from imagenet, with transfer learning / fine tuning and also training them from scratch, and we obtained much better accuracy, particularly with these models trained from scratch. We could try more recent and complex models such as ResNet50/101, InceptionV3 or EfficientNet too.